416 research outputs found

    Application-Driven Synthesis of Energy-Efficient Reconfigurable-Precision Operators

    Get PDF
    The increasing performance demands in emerging Internet of Things applications clash with the low energy budgets of end-nodes. Therefore, hardware operators able to reconfigure their computational precision at runtime are increasingly employed in these devices, to obtain good-enough results at minimal energy costs. Among the many methods proposed to implement such operators, Dynamic Voltage and Accuracy Scaling (DVAS) is particularly promising, due to its broad applicability and low overheads. However, a straight-forward application of DVAS conflicts with the optimizations performed by classic EDA algorithms, and does not yield the expected results. In this paper, we propose a novel synthesis algorithm for reconfigurable-precision circuits, that allows to integrate DVAS in a standard implementation flow. Moreover, we show how this algorithm can exploit information about the application, namely on the frequency of usage of each precision, to further reduce the total energy consumption. Applying our method to the popular LeNet neural network for digit recognition, we are able to reduce the energy due to Multiply-And-Accumulate (MAC) operations by 25%, compared to a straight-forward application of DVAS

    Empirical derivation of upper and lower bounds of NBTI aging for embedded cores

    Get PDF
    In deeply scaled CMOS technologies, device aging causes transistor performance parameters to degrade over time. While reliable models to accurately assess these degradations are available for devices and circuits, the extension to these models for estimating the aging of microprocessor cores is not trivial and there is no well accepted model in the literature. This work proposes a methodology for deriving an NBTI-induced aging model for embedded cores. Since aging can only be determined on a netlist, we use an empirical approach based on characterizing the model using a set of open synthesizable embedded cores, which allows us to establish a link between the aging at the transistor level and the aging from the core perspective in terms of maximum frequency degradation. Using this approach, we were able to (1) prove the independence of the aging on the workloads which run by the cores, and (2) calculate upper and lower bounds for the “aging factor” that can be used for a generic embedded processor. Results show that our method yields very good accuracy in predicting the frequency degradation of cores due to NBTI aging effect, and can be used with confidence when the netlist of the cores is not available

    Modeling of thermally induced skew variations in clock distribution network

    Get PDF
    Clock distribution network is sensitive to large thermal gradients on the die as the performance of both clock buffers and interconnects are affected by temperature. A robust clock network design relies on the accurate analysis of clock skew subject to temperature variations. In this work, we address the problem of thermally induced clock skew modeling in nanometer CMOS technologies. The complex thermal behavior of both buffers and interconnects are taken into account. In addition, our characterization of the temperature effect on buffers and interconnects provides valuable insight to designers about the potential impact of thermal variations on clock networks. The use of industrial standard data format in the interface allows our tool to be easily integrated into existing design flow

    Two-stage Human Activity Recognition on Microcontrollers with Decision Trees and CNNs

    Get PDF
    Human Activity Recognition (HAR) has become an increasingly popular task for embedded devices such as smartwatches. Most HAR systems for ultra-low power devices are based on classic Machine Learning (ML) models, whereas Deep Learning (DL), although reaching state-of-the-art accuracy, is less popular due to its high energy consumption, which poses a significant challenge for battery-operated and resource-constrained devices. In this work, we bridge the gap between on-device HAR and DL thanks to a hierarchical architecture composed of a decision tree (DT) and a one dimensional Convolutional Neural Network (ID CNN). The two classifiers operate in a cascaded fashion on two different sub-tasks: the DT classifies only the easiest activities, while the CNN deals with more complex ones. With experiments on a state-of-the-art dataset and targeting a single-core RISC-V MCU, we show that this approach allows to save up to 67.7% energy w.r.t. a 'stand-alone' DL architecture at iso-accuracy. Additionally, the two-stage system either introduces a negligible memory overhead (up to 200 B) or on the contrary, reduces the total memory occupation

    Approximate energy-efficient encoding for serial interfaces

    Get PDF
    Serial buses are ubiquitous interconnections in embedded computing systems that are used to interface processing elements with peripherals, such as sensors, actuators, and I/O controllers. Despite their limited wiring, as off-chip connections they can account for a significant amount of the total power consumption of a system-on-chip device. Encoding the information sent on these buses is the most intuitive and affordable way to reduce their power contribution; moreover, the encoding can be made even more effective by exploiting the fact that many embedded applications can tolerate intermediate approximations without a significant impact on the final quality of results, thus trading off accuracy for power consumption. We propose a simple yet very effective approximate encoding for reducing dynamic energy in serial buses. Our approach uses differential encoding as a baseline scheme and extends it with bounded approximations to overcome the intrinsic limitations of differential encoding for data with low temporal correlation. We show that the proposed scheme, in addition to yielding extremely compact codecs, is superior to all state-of-the-art approximate serial encodings over a wide set of traces representing data received or sent from/to sensor or actuators

    Sequence-To-Sequence Neural Networks Inference on Embedded Processors Using Dynamic Beam Search

    Get PDF
    Sequence-to-sequence deep neural networks have become the state of the art for a variety of machine learning applications, ranging from neural machine translation (NMT) to speech recognition. Many mobile and Internet of Things (IoT) applications would benefit from the ability of performing sequence-to-sequence inference directly in embedded devices, thereby reducing the amount of raw data transmitted to the cloud, and obtaining benefits in terms of response latency, energy consumption and security. However, due to the high computational complexity of these models, specific optimization techniques are needed to achieve acceptable performance and energy consumption on single-core embedded processors. In this paper, we present a new optimization technique called dynamic beam search, in which the inference complexity is tuned to the difficulty of the processed input sequence at runtime. Results based on measurements on a real embedded device, and on three state-of-the-art deep learning models, show that our method is able to reduce the inference time and energy by up to 25% without loss of accuracy

    LAPSE: Low-Overhead Adaptive Power Saving and Contrast Enhancement for OLEDs

    Get PDF
    Organic Light Emitting Diode (OLED) display panels are becoming increasingly popular especially in mobile devices; one of the key characteristics of these panels is that their power consumption strongly depends on the displayed image. In this paper we propose LAPSE, a new methodology to concurrently reduce the energy consumed by an OLED display and enhance the contrast of the displayed image, that relies on image-specific pixel-by-pixel transformations. Unlike previous approaches, LAPSE focuses specifically on reducing the overheads required to implement the transformation at runtime. To this end, we propose a transformation that can be executed in real time, either in software, with low time overhead, or in a hardware accelerator with a small area and low energy budget. Despite the significant reduction in complexity, we obtain comparable results to those achieved with more complex approaches in terms of power saving and image quality. Moreover, our method allows to easily explore the full quality-versus-power tradeoff by acting on a few basic parameters; thus, it enables the runtime selection among multiple display quality settings, according to the status of the system
    corecore